Computing q-gram Frequencies on Collage Systems

نویسندگان

  • Keisuke Goto
  • Hideo Bannai
  • Shunsuke Inenaga
  • Masayuki Takeda
چکیده

Collage systems are a general framework for representing outputs of various text compression algorithms. We consider the all q-gram frequency problem on a compressed string represented as a collage system, and present an O((q + h log n)n)-time O(qn)-space algorithm for calculating the frequencies for all q-grams that occur in the string. Here, n and h are respectively the size and height of the collage system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Computing q-Gram Non-overlapping Frequencies on SLP Compressed Texts

Length-q substrings, or q-grams, can represent important characteristics of text data, and determining the frequencies of all qgrams contained in the data is an important problem with many applications in the field of data mining and machine learning. In this paper, we consider the problem of calculating the non-overlapping frequencies of all q-grams in a text given in compressed form, namely, ...

متن کامل

Speeding Up q-Gram Mining on Grammar-Based Compressed Texts

We present an efficient algorithm for calculating q-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP T of size n that represents string T , the algorithm computes the occurrence frequencies of all q-grams in T , by reducing the problem to the weighted q-gram frequencies problem on a trie-like structure of size m = |T | − dup(q, T...

متن کامل

Chipping Away at Censorship Firewalls with User-Generated Content

Oppressive regimes and even democratic governments restrict Internet access. Existing anti-censorship systems often require users to connect through proxies, but these systems are relatively easy for a censor to discover and block. This paper offers a possible next step in the censorship arms race: rather than relying on a single system or set of proxies to circumvent censorship firewalls, we e...

متن کامل

Gene Prioritization by Compressive Data Fusion and Chaining

Data integration procedures combine heterogeneous data sets into predictive models, but they are limited to data explicitly related to the target object type, such as genes. Collage is a new data fusion approach to gene prioritization. It considers data sets of various association levels with the prediction task, utilizes collective matrix factorization to compress the data, and chaining to rel...

متن کامل

COMPUTING THE EIGENVALUES OF CAYLEY GRAPHS OF ORDER p2q

A graph is called symmetric if its full automorphism group acts transitively on the set of arcs. The Cayley graph $Gamma=Cay(G,S)$ on group $G$ is said to be normal symmetric if $N_A(R(G))=R(G)rtimes Aut(G,S)$ acts transitively on the set of arcs of $Gamma$. In this paper, we classify all connected tetravalent normal symmetric Cayley graphs of order $p^2q$ where $p>q$ are prime numbers.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1107.3019  شماره 

صفحات  -

تاریخ انتشار 2011